convolution block
Deeply Shared Filter Bases for Parameter-Efficient Convolutional Neural Networks
Modern convolutional neural networks (CNNs) have massive identical convolution blocks, and, hence, recursive sharing of parameters across these blocks has been proposed to reduce the amount of parameters. However, naive sharing of parameters poses many challenges such as limited representational power and the vanishing/exploding gradients problem of recursively shared parameters. In this paper, we present a recursive convolution block design and training method, in which a recursively shareable part, or a filter basis, is separated and learned while effectively avoiding the vanishing/exploding gradients problem during training. We show that the unwieldy vanishing/exploding gradients problem can be controlled by enforcing the elements of the filter basis orthonormal, and empirically demonstrate that the proposed orthogonality regularization improves the flow of gradients during training. Experimental results on image classification and object detection show that our approach, unlike previous parameter-sharing approaches, does not trade performance to save parameters and consistently outperforms overparameterized counterpart networks. This superior performance demonstrates that the proposed recursive convolution block design and the orthogonality regularization not only prevent performance degradation, but also consistently improve the representation capability while a significant amount of parameters are recursively shared.
DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions
Zhou, Yifan, Ohkawa, Takehiko, Zhou, Guwenxiao, Goto, Kanoko, Hirose, Takumi, Sekikawa, Yusuke, Inoue, Nakamasa
Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). T o handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. T o address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.
Cardiac MRI Semantic Segmentation for Ventricles and Myocardium using Deep Learning
Mukisa, Racheal, Bansal, Arvind K.
Automated noninvasive cardiac diagnosis plays a critical role in the early detection of cardiac disorders and cost - effective clinical management. Automated diagnosis involves the automated segmentation and analysis of cardiac images. Precise delineation of cardiac substructures and extraction of their morphological attributes are essential for evaluating the cardiac function, and diagnosing cardiovascular disease such as cardiomyopathy, valvular diseases, abnormalities related to septum perforations, and blood - flow rate . Semantic segmentation labels the CMR image at the pixel - level, and localizes its subcomponents to facilitate the detection of abnormalities, including abnormalities in cardiac wall motion in an aging heart with muscle abnormalities, vascular abnormalities, and valvular abnormalities. In this paper, we describe a model to improve semantic segmentation of CMR images. The model extracts edge - attributes and context information during down - sampling of the U - Net and infuses this information during up - sampling to localize three major cardiac structures: left ventricle cavity (LV); right ventricle cavity (RV); and LV myocardium (LMyo) . We present an algorithm and performance results. A comparison of our model with previous leading models, using similarity - metrics between actual image and segmented image, shows that our approach improves Dice s imilarity c oefficient (DSC) by 2% - 11% and lowers Hausdorff distance (HD) by 1.6 - 5.7 mm .
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Kim, Ji-Hoon, Choi, Jeongsoo, Kim, Jaehun, Jung, Chaeyoung, Chung, Joon Son
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.